Exploratory Analysis

After reading this page, you will have an overall sense about the crimes around Columbia University campus (including CUIMC). We pictured the proportion of each level of crimes, and the charactristic of suspects and victims. We also focused on the total crime numbers across the years, showing a pattern of the fluctation of offense number with the months and times, so that we can have an idea about when it will be more dangerous and need more caution. At the same time, we also captured some interesting point of offense number.

Overview about offenses around CU campus

filter_data = read_csv("../data/full_filter_data.csv")
## Rows: 291052 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (11): boro, success_fail, level, description, location, susp_age, susp_r...
## dbl  (9): id, year, month, day, hour, minute, second, latitude, longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
p <- c() # a list to save plots
p[[1]] <- ggplot(aes(x = level, fill = level), data = filter_data) + 
  geom_bar() +
  labs(x = "Crime Level", y = "Number") + guides(fill = "none")
p[[2]] <- ggplot(aes(x = success_fail, fill = success_fail), data = 
                  filter_data)+
    geom_bar()+
  labs(x = "Crime Completed or Not", y = "Number")+ guides(fill = "none")
p[[1]] + p[[2]]

The number of misdemeanor crime is the most (62.1%), while the number of felony is approximately half of misdemeanor (28.7%). The least type is violation (10%). Here exists a selection bias, most of people would not report a violation to the police. This figure does not indicate violation crimes is the least type in reality. At the meanwhile, nearly all (98%) of the crimes are completed – do not put your faith in luck that someone can help you, try to avoid them as much as you can! Next, we focus on the portrait of suspects and victims.

To have a closer look to the different levels, we listed top 5 descriptions from different level.

p[[3]] <- filter_data %>% 
  filter(susp_age %in% c("<18", "18-24", "25-44", "45-64", "65+",
                         "UNKNOWN")) %>%
  mutate(susp_age = as.factor(susp_age)) %>% 
  count(susp_age, level) %>% 
  plot_ly(x = ~ susp_age, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>% 
  layout(title = "Suspects' Age Group", xaxis = list(title = ""), yaxis = list(title = "Number"))
p[[4]] <- filter_data %>% 
  filter (susp_race != "(null)") %>% 
  mutate(susp_age = as.factor(susp_race)) %>% 
  count(susp_race, level) %>% 
  plot_ly(x = ~ susp_race, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>% 
  layout(title = "Suspects' Race", xaxis = list(title = ""), yaxis = list(title = "Number"))
p[[5]] <- filter_data %>% 
  filter (susp_sex != "(null)")  %>% 
  mutate(susp_sex = as.factor(susp_sex)) %>% 
  count(susp_sex, level) %>% 
  mutate(susp_sex = recode(susp_sex, U = "Unknown", F = "Female", "M" = "Male")) %>% 
  plot_ly(x = ~ susp_sex, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>% 
  layout(title = "Suspects' Sex", xaxis = list(title = ""), yaxis = list(title = "Number"))
p[[6]] <- filter_data %>% 
  filter(vic_age %in% c("<18", "18-24", "25-44", "45-64", "65+",
                         "UNKNOWN")) %>%
  mutate(vic_age = as.factor(vic_age)) %>% 
  count(vic_age, level) %>% 
  plot_ly(x = ~ vic_age, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>% 
  layout(title = "Victims' Age Group", xaxis = list(title = ""), yaxis = list(title = "Number"))
p[[7]] <- filter_data %>% 
  filter (vic_race != "(null)") %>% 
  mutate(vic_age = as.factor(vic_race)) %>% 
  count(vic_race, level) %>% 
  plot_ly(x = ~ vic_race, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>% 
  layout(title = "Victims' Race", xaxis = list(title = ""), yaxis = list(title = "Number"))
p[[8]] <- filter_data %>% 
  filter (vic_sex %in% c("D", "E", "F", "M"))  %>% 
  mutate(vic_sex = as.factor(vic_sex)) %>% 
  count(vic_sex, level) %>% 
  mutate(vic_sex = recode(vic_sex, F = "Female", "M" = "Male")) %>% 
  plot_ly(x = ~ vic_sex, y = ~n, type = "bar", color = ~ level, colors = "viridis") %>% 
  layout(title = "Victims' Sex", xaxis = list(title = ""), yaxis = list(title = "Number"))

Through the data, we can see that except unknown, suspects’ age is centered at 25-44 years old, black man. As for the victim, the age group is still centered at 25-44 years old. But the race is more evenly spread, and female number is much more than the suspect group.

Next, we want to look into the fluctuation of crime cases among months and hours, to see if the case number have a correlation with time.

Month data

Total crimes
data <- filter_data
month_data <- data %>%  
  group_by(year, month) %>% 
  summarise(number = n()) %>% 
  mutate(month = as.factor(month),
         year = as.factor(year))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
month_data %>% 
  plot_ly(y = ~number, color = ~month, type = "box", colors = "viridis") %>% 
  layout(title = "Crime Number Across Months", xaxis = list(title = "Month"), yaxis = list(title = "Number"))

We can roughly see that, in summer there are more crimes, especially from May to August. December and February have fewer cases. We will verify if there is a significant difference between months in Statistical testing part.

Hours
hour_data = data %>% 
  group_by(year, month, hour) %>% 
  summarise(number = n()) %>%
  mutate(hour = as.factor(hour),
    hour = fct_inseq(hour)) 
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
hour_data %>% 
  plot_ly(y = ~number, color = ~hour, type = "box", colors = "viridis") %>% 
  layout(title = "Crime Number Across Hours", xaxis = list(title = "Hour"), yaxis = list(title = "Number"))

Obvious pattern. It’s counter-intuitive that the time with most cases is not in the midnight, but in the afternoon.

what about specific days?

data %>% 
  group_by(month, day) %>% 
  summarise(number = n()) %>% 
  arrange(desc(number)) %>% 
  head( 5) %>% 
  knitr::kable()
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
month day number
1 1 1196
6 1 1084
7 1 1054
5 1 1051
9 1 1041
data %>% 
  group_by(month, day) %>% 
  summarise(number = n()) %>% 
  arrange((number)) %>%
  filter(month != 2 & day != 29) %>% 
  head(5) %>% 
  knitr::kable()
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
month day number
12 25 370
12 24 493
12 26 499
12 31 513
12 30 562

The crime numbers in each day across years also varies a lot. It’s a interesting finding that New Years’ day enjoys the most crime cases, and the second is International Children’s Day (sadly). The days with fewest crimes is around Chirstmas Holiday, when most people will stay safely with family at home.